Project Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Goal

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
library(stringr)
library(dplyr)
library(quanteda)
library(readtext)
library(R.utils)
library(ggplot2)

set.seed(3301)

Task 0: Understanding the problem

Tasks to accomplish

  1. Obtaining the data - Can you download the data and load/manipulate it in R?
  2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Questions to consider

  1. What do the data look like?
  2. Where do the data come from?
  3. Can you think of any other data sources that might help you in this project?
  4. What are the common steps in natural language processing?
  5. What are some common issues in the analysis of text data?
  6. What is the relationship between NLP and the concepts you have learned in the Specialization?

Dwonload data.

source("downloadData.R")

attach(downloadData(file.path("..", "data")))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"  
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"   
## [4] "../data/bad-words.txt"

First, try to processing entire files using our scratch implementation:

tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (FALSE && length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
        # Count tweet
        tweets <- tweets + 1
        # Show first 10 tweet
        if(tweets <= 10) {
                print(oneLine)
        }
        # Tokenize by regular expression
        words <- str_split(oneLine, "\\s+")[[1]]
        # To detect symbols like a ':)', initialise variable
        symbols <- rep(FALSE, length = length(words))
        # Each token:
        for(i in 1:length(words)) {
                # Extract token that has only symbol string
                symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
                # numbers, aggregate in '[numbers]'
                if(grepl("^[0-9]+$", words[i])) {
                        words[i] <- "[numbers]"
                }
        }
        # Tokens
        wordsPerLine <- length(simpleWords <- words[!symbols])
        # Count tokens ending with punctuation as the number of sentences
        for(i in 1:length(simpleWords)){
                if(grepl("[.!?]$", simpleWords[i])) {
                        sentencesTwitter <- sentencesTwitter + 1
                }
        }
        wordsTwitter <- wordsTwitter + wordsPerLine
        remove(simpleWords, words)
}
close(con)

tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"

## [1] 2360148
## [1] 29706404
## [1] 2818583

It takes a long time to calculate, so re-implement it using the package.

Loading files using the readtext package.

tweetFile <- readtext(twitter)
corpusTwitter <- corpus(tweetFile, cache = FALSE)
summary(corpusTwitter)
## Corpus consisting of 1 document:
## 
##               Text  Types   Tokens Sentences
##  en_US.twitter.txt 566951 36719658   2588551
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Mon Jul 30 04:24:29 2018
## Notes:

Task 1: Getting and cleaning the data

Tasks to accomplish

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tips, tricks, and hints

  1. Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.

  2. Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.

Sub-Sampling

tweets <- as.numeric(countLines(twitter))
twitterSubSampling <- paste0(twitter, ".sub-sampling.txt")
if(!file.exists(twitterSubSampling)) {
        subSamplingRate <- .01
        flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingRate)
        conRead <- file(twitter, "r")
        conWrite <- file(twitterSubSampling, "w")
        len <- 0
        while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
                len <- len + 1
                if(flipABiasedCoin[len] == 1) {
                        writeLines(oneLine, conWrite)
                }
        }
        close(conRead)
        close(conWrite)
}

subTweets <- as.numeric(countLines(twitterSubSampling))
subTweets
## [1] 23662

Tokenization

subTweetFile <- readtext(twitterSubSampling)
subTwitterCorpus <- corpus(subTweetFile, cache = FALSE)
summary(subTwitterCorpus)
## Corpus consisting of 1 document:
## 
##                                Text Types Tokens Sentences
##  en_US.twitter.txt.sub-sampling.txt 34531 368276     26039
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Mon Jul 30 04:26:15 2018
## Notes:

Load Bad Words

profanity <- readLines(badwords)

Task 2: Exploratory Data Analysis

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
Field Unit Sample sequence 1-gram sequence 2-gram sequence 3-gram sequence
Computational linguistics word … to be or not to be … …, to, be, or, not, to, be, … …, to be, be or, or not, not to, to be, … …, to be or, be or not, or not to, not to be, …

Top 20

subTweetsDfm <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        # tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity)

topfeatures(dfm(subTweetsDfm), 20)
##  the   to    i    a  you  and  for   in   of   is   it   my   on that   me 
## 9228 7770 7260 6194 5443 4470 3890 3725 3658 3475 3074 2988 2741 2292 2100 
##   be   at with have your 
## 1857 1816 1728 1712 1711

Plot Word Cloud

dfm(subTweetsDfm) %>% 
#        dfm_trim(min_termfreq = 10,
#                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE, 
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Nomarize Words

subTweetsDfmNomarized <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity)

Top 20 Nomarized Words

topfeatures(dfm(subTweetsDfmNomarized), 20)
##   just   like    get   love   good thanks    day    can    now     rt 
##   1528   1218   1182   1060   1058    918    917    905    871    862 
##    one  great   know    new   time      u  today     go    see    lol 
##    803    782    780    762    757    741    741    694    682    673

Plot Word Cloud

dfm(subTweetsDfmNomarized) %>%
#        dfm_trim(min_termfreq = 10,
#                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Frequency Plots

featuresTweetsDfm <- textstat_frequency(dfm(subTweetsDfmNomarized), n = 80)

# Sort by reverse frequency order
featuresTweetsDfm$feature <- with(featuresTweetsDfm, reorder(feature, -frequency))

ggplot(featuresTweetsDfm, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

2-Gram Top 20.

subTweetsDfmNomarized2Gram <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        # tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity) %>%
        tokens_ngrams(n = 2)
topfeatures(dfm(subTweetsDfmNomarized2Gram), 20)
##     in_the    for_the     of_the     on_the      to_be thanks_for 
##        777        714        592        466        453        446 
##     to_the     i_love  thank_you     at_the     if_you     i_have 
##        430        374        363        360        344        317 
##       i_am      for_a     have_a   going_to     to_see    will_be 
##        316        314        313        303        285        277 
##       is_a     to_get 
##        269        263

2-Gram Word Cloud

dfm(subTweetsDfmNomarized2Gram) %>%
#        dfm_trim(min_termfreq = 10,
#                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

2-Gram Frequency Plots

featuresTweetsDfm2Gram <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram), n = 80)

# Sort by reverse frequency order
featuresTweetsDfm2Gram$feature <- with(featuresTweetsDfm2Gram, reorder(feature, -frequency))

ggplot(featuresTweetsDfm2Gram, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

3-Gram Top 20

subTweetsDfmNomarized3Gram <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        #tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity) %>%
        tokens_ngrams(n = 3)
topfeatures(dfm(subTweetsDfmNomarized3Gram), 20)
##       thanks_for_the        thank_you_for   looking_forward_to 
##                  231                   91                   90 
##           i_love_you        can't_wait_to       for_the_follow 
##                   88                   82                   76 
##            i_want_to           one_of_the          going_to_be 
##                   72                   65                   61 
##           to_see_you         have_a_great             a_lot_of 
##                   61                   59                   58 
##            i_need_to            i_have_to         i'm_going_to 
##                   53                   51                   49 
##          is_going_to          you_want_to thanks_for_following 
##                   47                   46                   45 
##          how_are_you          let_me_know 
##                   44                   44

3-Gram Word Cloud

dfm(subTweetsDfmNomarized3Gram) %>%
        textplot_wordcloud(#min_count = 4,
                           random_order = FALSE,
                           max_words = 50,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

3-Gram Frequency Plots

featuresTweetsDfm3Gram <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram), 60)

# Sort by reverse frequency order
featuresTweetsDfm3Gram$feature <- with(featuresTweetsDfm3Gram, reorder(feature, -frequency))

ggplot(featuresTweetsDfm3Gram, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

1-gram 90%tile:

featuresTweetsDfmFull <- textstat_frequency(dfm(subTweetsDfmNomarized))
summary(featuresTweetsDfmFull)
##    feature            frequency             rank          docfreq 
##  Length:25808       Min.   :   1.000   Min.   :    1   Min.   :1  
##  Class :character   1st Qu.:   1.000   1st Qu.: 6453   1st Qu.:1  
##  Mode  :character   Median :   1.000   Median :12904   Median :1  
##                     Mean   :   6.404   Mean   :12904   Mean   :1  
##                     3rd Qu.:   3.000   3rd Qu.:19356   3rd Qu.:1  
##                     Max.   :1528.000   Max.   :25808   Max.   :1  
##     group          
##  Length:25808      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfmFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    9 1528

2-gram 90%tile:

featuresTweetsDfm2GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram))
summary(featuresTweetsDfm2GramFull)
##    feature            frequency            rank           docfreq 
##  Length:161608      Min.   :  1.000   Min.   :     1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.: 40403   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median : 80804   Median :1  
##                     Mean   :  1.808   Mean   : 80804   Mean   :1  
##                     3rd Qu.:  1.000   3rd Qu.:121206   3rd Qu.:1  
##                     Max.   :777.000   Max.   :161608   Max.   :1  
##     group          
##  Length:161608     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfm2GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    2  777

3-gram 90%tile:

featuresTweetsDfm3GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram))
summary(featuresTweetsDfm3GramFull)
##    feature            frequency            rank           docfreq 
##  Length:260218      Min.   :  1.000   Min.   :     1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.: 65055   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median :130110   Median :1  
##                     Mean   :  1.123   Mean   :130110   Mean   :1  
##                     3rd Qu.:  1.000   3rd Qu.:195164   3rd Qu.:1  
##                     Max.   :231.000   Max.   :260218   Max.   :1  
##     group          
##  Length:260218     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfm3GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    1  231

Seems like Zipf’s law.

Task 3: Modeling

Tasks to accomplish

  1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
  2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

  1. How can you efficiently store an n-gram model (think Markov Chains)?
  2. How can you use the knowledge about word frequencies to make your model smaller and more efficient?
  3. How many parameters do you need (i.e. how big is n in your n-gram model)?
  4. Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
  5. How do you evaluate whether your model is any good?
  6. How can you use backoff models to estimate the probability of unobserved n-grams?

Good Turing Estimation

simpleGoodTuring <- function(r, Nr) {
        # number of words
        N <- sum(r * Nr)
        d <- diff(r)
    
        ## Turing estimate
        # turing estimate index
        ti <- which(d == 1)
        # discount coefficients of Turing estimate
        dct <- numeric(length(r))
        dct[ti] <- (r[ti] + 1) / r[ti] * c(Nr[-1], 0)[ti] / Nr[ti]

        ## Linear Good-Turing estimate
        Zr <- Nr / c(1, 0.5 * (d[-1] + d[-length(d)]), d[length(d)])
        f <- lsfit(log(r), log(Zr))
        coef <- f$coef
        # corrected term frequency
        rc <- r * (1 + 1 / r)^(1 + coef[2])  
        # discount coefficients of Linear Good-Turing estimate
        dclgt <- rc / r

        ## make switch from Turing to LGT estimates
        # standard deviation of term frequencies between 'r' and 'rc' (?)
        rsd <- rep(1,length(r))        
        rsd[ti] <- (seq_len(length(r))[ti] + 1) / Nr[ti] * sqrt(Nr[ti + 1] * (1 + Nr[ti + 1] / Nr[ti]))
        
        dc <- dct
        for (i in 1:length(r)) {
            if (abs(dct[i] - dclgt[i]) * r[i] / rsd[i] <= 1.65) {
                dc[i:length(dc)] <- dclgt[i:length(dc)]
                break
            }
        }

        ## renormalize the probabilities for observed objects
        # summation of probabilities
        sump <- sum(dc * r * Nr) / N
        # renormalized discount coefficients
        dcr <- (1 - Nr[1] / N) * dc / sump
        
        # term frequency
        tf <- c(Nr[1] / N, r * dcr)
        p <- c(Nr[1] / N, r * dcr / N)
        names(p) <- names(tf) <- c(0, r)        
        
        list(p = p, r = tf)
}

Basic 2-gram Model:

NrTbl2 <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram)) %>%
        select(frequency) %>%
        mutate(freqOfFrequency = 1) %>%
        group_by(frequency) %>%
        summarise_all(sum)

Nr2 <- NrTbl2$freqOfFrequency
r2 <- NrTbl2$frequency

sgt2 <- simpleGoodTuring(r2, Nr2)

dBigram <- function(freq) {
        sgt2$r[as.character(freq)] / freq
}

nextWords2Gram <- function(input, outputs = 3, k = 0) {
        # k is the least important of the parameters. It is usually chosen to be 0.
        # However, empirical testing may find better values for k.
        featuresNextWord <- NULL

        # extract n-gram that starts with input
        nextWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
                                         phrase(paste0(
                                                 input, "_*"
                                         ))))

        if (length(nextWordDfm) > k) {
                prevWordDfm <- dfm(tokens_select(subTweetsDfm,
                                                 phrase(input)))
                prevWordFreq <- textstat_frequency(prevWordDfm)$frequency

                # data frame
                featuresNextWord <-
                        textstat_frequency(nextWordDfm) %>%
                        mutate(p_bo = dBigram(frequency) * frequency / prevWordFreq)

                # human readable outputs
                featuresNextWord$feature <-
                        sapply(as.vector(featuresNextWord$feature),
                               function(x) {
                                       str_split(x, "_")[[1]][2]
                               })
                
                # Sort by reverse frequency order
                featuresNextWord$feature <-
                        with(featuresNextWord,
                             reorder(feature,-p_bo))
                
        } else {
                trigramDfm <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram)) %>%
                        filter(frequency > k)
                prevWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
                                                 phrase(triGram)))
                prevWordFreq <- textstat_frequency(prevWordDfm)$frequency

                # data frame
                featuresNextWord <-
                        trigramDfm %>%
                        mutate(p_bo = dBigram(frequency) * frequency / prevWordFreq)

                beta <- 1 - sum(featuresNextWord$freqency)
                featuresNextWord <- nextWords2Gram(biGram, outputs = outputs, k = k)
        }

        featuresNextWord %>% slice(1:outputs)
}

Next word of Looking is:

ggplot(nextWords2Gram("Looking"), aes(x = feature, y = p_bo)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

Next word of forward is:

ggplot(nextWords2Gram("forward"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be

I went

ggplot(nextWords2Gram("went"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to

ggplot(nextWords2Gram("to"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be

ggplot(nextWords2Gram("be"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be a

ggplot(nextWords2Gram("a"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be a great

ggplot(nextWords2Gram("great"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be a great day …

3-gram Model using Katz’s back-off model

NrTbl3 <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram)) %>%
        select(frequency) %>%
        mutate(freqOfFrequency = 1) %>%
        group_by(frequency) %>%
        summarise_all(sum)

Nr3 <- NrTbl3$freqOfFrequency
r3 <- NrTbl3$frequency

sgt3 <- simpleGoodTuring(r3, Nr3)

dTrigram <- function(freq) {
        sgt3$r[as.character(freq)] / freq
}

nextWords <- function(input, ngram = 3, outputs = 3, k = 0) {
        # k is the least important of the parameters. It is usually chosen to be 0.
        # However, empirical testing may find better values for k.
        inputs <- str_split(input, "\\s+")[[1]]
        inputsSize <- length(inputs)
        if (inputsSize == 1) { return(nextWords2Gram(input)) }
        triGram <- paste(inputs[inputsSize - 1],
                         inputs[inputsSize],
                         sep = "_")
        biGram <- inputs[inputsSize]
        featuresNextWord <- NULL

        # extract n-gram that starts with input
        nextWordDfm <- dfm(tokens_select(subTweetsDfmNomarized3Gram,
                                         phrase(paste0(
                                                 triGram, "_*"
                                         ))))

        if (length(nextWordDfm) > k) {
                prevWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
                                                 phrase(triGram)))
                prevWordFreq <- textstat_frequency(prevWordDfm)$frequency

                # data frame
                featuresNextWord <-
                        textstat_frequency(nextWordDfm) %>%
                        mutate(p_bo = dTrigram(frequency) * frequency / prevWordFreq)

                # human readable outputs
                featuresNextWord$feature <-
                        sapply(as.vector(featuresNextWord$feature),
                               function(x) {
                                       str_split(x, "_")[[1]][3]
                               })
                
                # Sort by reverse frequency order
                featuresNextWord$feature <-
                        with(featuresNextWord,
                             reorder(feature,-p_bo))
                
        } else {
                trigramDfm <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram)) %>%
                        filter(frequency > k)

                prevWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
                                                 phrase(triGram)))
                if(length(prevWordDfm) != 0) {
                        prevWordFreq <- textstat_frequency(prevWordDfm)$frequency
        
                        featuresNextWord <-
                                trigramDfm %>%
                                mutate(p_bo = dBigram(frequency) * frequency / prevWordFreq)
        
        
                        beta <- 1 - sum(featuresNextWord$freqency)
        
                        p <- sum(nextWords2Gram(biGram, outputs = outputs, k = k)$p_bo)
        
                        alpha <- beta / p
                        print(alpha)
                        featuresNextWord <- nextWords2Gram(biGram, outputs = outputs, k = k) %>%
                                mutate(p_bo = p_bo * alpha)
                } else {
                        featuresNextWord <- nextWords2Gram(biGram, outputs = outputs, k = k)
                }

        }
        
        featuresNextWord %>% slice(1:outputs)
}

I went to be

ggplot(nextWords("I went to be"), aes(x = feature, y = p_bo)) +
        geom_bar(stat = "identity") + 
        xlab("Next word") + ylab("P_bo")

ggplot(nextWords("w a a"), aes(x = feature, y = p_bo)) +
        geom_bar(stat = "identity") + 
        xlab("Next word") + ylab("P_bo")

References